%% note %% = internal notes

%% Read in all scripts here %%

%% cache all inputs and uncache when scripts are updated %%

# admin
source("~code/01 - Admin (Closed).R")
source("~code/02 - Census Vars and Functions (Closed).R")

# Austin
source("~code/Austin/AU - 02 - Clean Scooter Data (Open).R")
source("~code/Austin/AU - 21 - Collect and Clean Census Data (Open).R")

# Chicago
source("~code/Chicago/CH - 01 - Read Scooter Data (Closed).R")
source("~code/Chicago/CH - 20 - Collect Census Data (Open).R")
source("~code/Chicago/CH - 21 - Clean Census Data (Open).R")

# DC

# Kansas City

# Louisville
source("~code/Louisville/LV - 01 - Read Scooter and Base Map Data (Closed).R")
source("~code/Louisville/LV - 02 - Clean Scooter Data (Closed).R")
source("~code/Louisville/LV - 03 - Prep Rebalance Data - User Events (Closed).R")
source("~code/Louisville/LV - 04 - Prep Rebalance Data - Rebalance Events (Closed).R")
source("~code/Louisville/LV - 20 - Collect Census Data (Closed).R")
source("~code/Louisville/LV - 21 - Clean Census Data (Closed).R")

# Minneapolis

# Model
source("~code/Model/Model - 01 - Setting up (Open).R")

1. Introduction

1.1 About this Project

The following document presents an analysis of shared, dockless electric scooter systems in several American cities and a web tool for predicting scooter demand in cities that do not currently have shared scooters. We focus in particular on the equity implications of these systems: who currently has access to scooters, and who will have access if we keep following the business-as-usual approach? This document presents an overview of our data and use case, a summary and key takeaways from our analysis, and an appendix with all of the R code necessary to replicate our work.

This project was produced for the MUSA/Smart Cities Practicum course (MUSA 801) taught by Ken Steif, Michael Fichman, and Matt Harris in the Master of Urban Spatial Analytics and Master of City Planning Programs at the University of Pennsylvania. We are deeply grateful to our instructors for their guidance, feedback, and attention throughout the semester, despite the challenges brought on by the ongoing pandemic. We also thank Michael Schnuerle from the City of Louisville Metro Government and Sharada Strasmore from the DC Department of Transportation for providing data that made our rebalancing analysis possible as well as sharing their insights into and knowledge of the scooter and micromobility planning process. Lastly, we would like to acknowledge our classmates in MUSA and city planning, who not only produced incredible projects of their own this semester, but also provided thoughtful feedback and support throughout our time in the programs.

1.2 Abstract

In the few short years since they first launched, shared, dockless electric scooters have become ubiquitous sights on streets and sidewalks in cities across America. What may have first been seen as novelties or purely recreational vehicles now play critical roles in many people’s daily transportation routines. Despite being relative newcomers to the urban transportation system, dockless scooters provided over 38 million trips in 2018, more than the number of rides taken on traditional station-based bikeshare systems that year. Yet, while electric scooters have quickly enmeshed themselves in the urban fabric, access to these vehicles is not spread equitably across cities. While residents in wealthier, predominantly white downtown neighborhoods enjoy easy access to shared scooters, residents in poorer but comparably dense parts of cities outside of downtown are underserved by the systems.

In this study, we use a combination of open and private dockless scooter usage data from six American cities to construct a model for predicting ridership in ten cities that have not had scooter share systems in the past. Our model shows that the business-as-usual approach to introducing scooters into a new market is likely to create inequitable access to the vehicles for residents. While cities such as Louisville, KY have recognized these inequities and instituted distribution requirements to address them, we show through analysis of vehicle rebalancing data that providers do not seem to be complying with these requirements, and stronger enforcement may be necessary. Lastly, we introduce a proof-of-concept web application that allows users to explore the spatial distribution of our model’s predictions for each city and compare them to demographic and socioeconomic variables of interest. We believe that this tool will allow policymakers to anticipate the geography of scooter ridership in their cities and understand - and ultimately plan for - the inequities created by the business-as-usual approach to launching and administering scooter share systems.

1.3 Motivation

Since Bird and Lime launched the first shared, dockless electric scooter services in Santa Monica, California in September 2017, scooters have rapidly spread across American cities, becoming a popular form of urban transportation. As of January 2020, there are 340 scooter share programs operating in 242 municipal areas and campuses across 40 different states (plus Washington D.C.). In 2018 alone, users took 38.5 million trips on electric scooters, more than the number of trips taken on more familiar, traditional station-based bikeshare systems. While scooter share providers initially entered new municipalities and markets without local officials’ permission or oversight, leading to spikes in scooter-related injuries and complaints of vehicles blocking sidewalks, cities have begun collaborating through coalitions like the Open Mobility Foundation to institute some oversight over these programs. Many municipalities are now working with their scooter providers to ensure that their scooter share programs, among other goals, meet safety standards, distribute vehicles equitably, keep sidewalks clear, and protect rider privacy. Data standards like the Mobility Data Specification (MDS), created by the Los Angeles Department of Transportation, help cities share and monitor scooter ridership data and make sure that providers are complying with their policies.

While these data initiatives help address cities manage more mature scooter share programs, there are no widely adopted models or processes in place that help cities without shared scooters introduce the vehicles into their markets. Further, while some cities like Chicago have issued citations to enforce their requirements for equitable distribution, not all cities have done so, meaning that in some places, these distribution requirements are without teeth. As we see in our analysis of vehicle distribution in Louisville, scooter companies do not necessarily comply with existing distribution requirements. In this project, we use data from 6 different American cities with shared scooters to develop a model that estimates what peak-season demand will be in cities without existing programs. We use this model to build a prototype for a web application intended to help city officials anticipate the geography of scooter ridership in their cities and understand its relationship to the city’s social and economic geography. Our goal is to create a municipal scooter planning toolkit that helps cities interested in launching scooter share systems learn from other municipalities that already have these systems in place. We hope that cities like Philadelphia, Pennsylvania and Madison, Wisconsin, which are considering adopting scooter share programs, will find this toolkit helpful as they work with providers to bring the vehicles to their communities.

1.4 Summary

Using a combination of publicly available and private scooter ridership data from six American cities, we use machine learning methods to create a model that predicts the total scooter trips that will be taken between July and September in each census tract in 10 cities that do not currently have scooter share programs. Our model uses a total of 24 features encompassing demographic, socioeconomic, and built environment characteristics for the cities to make its predictions. We emphasize that our model predictions reflect both the underlying demand for scooters that may exist in a census tract as well as the impact of the scooter companies’ fleet management and distribution choices. Our model uses existing ridership data to predict how scooter usage would look in a new city if it were to follow the business-as-usual approach.

Based on our model predictions, we find that the business-as-usual approach will likely lead to inequitable access in new cities. Census tracts with high predicted ridership tend to be %%[…]%%, whereas tracts with lower ridership tend to be %%[…]%%. We propose combining these socioeconomic indicators into a single Equity Score, which cities could customize to their own priorities and rely on while planning and launching a scooter share system or policy.

2. Data

2.1 Outcome Variable and Unit of Analysis

For our unit of analysis, we use the total number of rides taken between July and September of 2019 in each census tract for each city. We chose this time period partly due to data limitations - Chicago only recently instituted scooter share and does not have a full year of data available - and also because the later summer and early fall represent peak ridership. We chose census tracts as our spatial unit of analysis because they represent the highest level of geographic aggregation in the scooter ridership datasets. While the private Louisville and Washington, D.C. datasets provide coordinates for ride pick-ups and drop-offs, Austin’s publicly available dataset aggregates rides to the pick-up and drop-off census tract to protect rider privacy.

In addition to the level of geographic aggregation, the ridership data provided varying information.

City Geographic Aggregation Time Period Available Temporal Precision Other Info Fleet/Rebalancing Info
Louisville Coordinates Nov. 2018 - Dec. 2019 Actual time Trip ID
Vehicle ID
Battery Level
Operator
Rebalancing
Vehicle Maintenance/Retirement/Entry
Washington, DC Coordinates Actual time Trip ID
Vehicle ID
Trip Distance
Trip Duration
Operator
Austin Census Tract April 2018 - Present 15 minutes Trip ID
Vehicle ID
Trip Distance
Trip Duration
Council District
No
Minneapolis Street May 2019 - Sept. 2019 30 minutes Trip ID
Vehicle ID
Trip Distance
Trip Duration
No
Kansas City Truncated Coordinates June 2019 - Dec. 2019 15 minutes Trip ID
Vehicle ID
Trip Distance
Trip Duration
No
Chicago Census Tract June 2019 - Sept. 2019 Hour Trip ID
Vehicle ID
Trip Distance
Trip Duration
Community Area Name
No

Part of our data wrangling process was transforming the ridership data into the same level of spatial aggregation. Chicago, for instance, was already aggregated at the census tract level, so it did not require any additional aggregation.

Louisville and DC, on the other hand, provided point data. We aggregated this to the census tract level.

2.2 Explanatory Variables

For our model features, we use variables from the US Census Bureau and OpenStreetMap that we believe would reflect both the underlying demand for scooters in a census tract and the likelihood that a provider would make more vehicles available in a tract.

Demographic

  • Total Population
  • Median Age
  • Percentage White Population
  • Percentage Female Population

Socio-economic

  • Household Income
  • Home Values and Rental Prices
  • Commute Modeshare (transit v driving)
  • Commute Distance (30+ minutes)
  • Housing Units and Occupancy Rates
  • Vehicle Ownership
  • Jobs

Built Environment

  • Retail Stores
  • Restaurants
  • Leisure Activities and Tourism Destinations
  • Transportation Infrastructure
  • Offices

Our final model uses 24 features built from these variables that served as useful predictors for scooter ridership in a census tract. Our data panel looked like the below:

ORIGINS_CNT TOTPOP TOTHSEUNI MDHHINC MDAGE MEDVALUE MEDRENT PWHITE PTRANS PDRIVE PFEMALE PCOM30PLUS POCCUPIED PVEHAVAI RATIO_RETAIL RATIO_OFFICE RATIO_RESTAURANT RATIO_PUBLIC_TRANSPORT RATIO_LEISURE RATIO_TOURISM RATIO_COLLEGE RATIO_CYCLEWAY RATIO_STREET JOBS_IN_TRACT WORKERS_IN_TRACT
349 2802 80 28490 35.8 47300 703 0.8183440 0.0423620 0.8844673 0.4857245 0.0483450 0.7001634 0.8459564 0.0102041 0.0102041 0.0102041 0.1326531 0.8469388 0.0102041 0.0102041 0.0183206 0.0150866 991 1059
138 2399 80 25673 40.6 48000 710 0.5043768 0.0842956 0.7297921 0.5518966 0.0535211 0.8185841 0.8937644 0.1836735 0.0102041 0.0102041 0.1836735 0.7346939 0.0102041 0.0102041 0.0136005 0.0103508 456 1052
76 4612 150 29733 39.5 75600 804 0.1986123 0.0582278 0.8449367 0.5615785 0.0552426 0.8496132 0.9468354 0.0102041 0.0102041 0.0102041 0.1224490 2.6530612 0.0102041 0.0102041 0.0276389 0.0171585 75 1913
164 1790 100 25435 34.6 50200 708 0.0212291 0.1343931 0.6734104 0.5122905 0.0557905 0.7820372 0.8294798 0.0102041 0.0102041 0.0102041 0.0306122 0.0102041 0.0102041 0.0102041 0.0216312 0.0115308 579 720
56 2724 80 19746 35.6 49800 778 0.1119677 0.1368984 0.7358289 0.5499266 0.0512122 0.8283358 0.8278075 0.0102041 0.0102041 0.0102041 0.0204082 0.3673469 0.0102041 0.0102041 0.0113234 0.0065086 57 1124
56 2152 100 35625 35.7 74500 664 0.0185874 0.0992366 0.8636859 0.5836431 0.0480070 0.7626208 0.8822246 0.0102041 0.0102041 0.0102041 0.0918367 1.0102041 0.0102041 0.0102041 0.0139273 0.0069871 49 951
26 2022 100 20500 38.4 57600 567 0.0558853 0.1664296 0.8065434 0.5158259 0.0518519 0.8058252 0.8449502 0.0102041 0.0102041 0.0102041 0.0102041 0.0102041 0.0102041 0.0102041 0.0057834 0.0041512 31 800
49 2729 80 23533 30.3 57000 726 0.0974716 0.1342952 0.6082131 0.5844632 0.0584891 0.7244656 0.7913430 0.0102041 0.0102041 0.0102041 0.0102041 0.0816327 0.0102041 0.0102041 0.0190098 0.0083222 1290 1030
34 3075 100 38145 40.9 75800 695 0.0344715 0.1090742 0.8157654 0.5479675 0.0518346 0.8011118 0.8900092 0.0102041 0.0102041 0.0102041 0.0102041 0.0102041 0.0102041 0.0102041 0.0199580 0.0096401 64 1482
15 3202 90 31000 34.0 69800 853 0.0062461 0.1282985 0.6451319 0.5908807 0.0432909 0.8966038 0.8917197 0.0102041 0.0102041 0.0102041 0.0102041 0.5816327 0.0102041 0.0102041 0.0145422 0.0136271 1077 1325

3. Exploratory Analysis and Feature Engineering

3.1 Scooter Ridership Data

A map of the 6 cities shows that most rides originate and end in a small number of census tracts.

Scooter Trips by City

Louisville

Washington, DC

Austin

Minneapolis

Inflow data is not available for Minneapolis.

Kansas City

Chicago

A persistent problem in micromobility programs is unbalanced vehicle flow, when riders take more vehicles away from a place than other riders bring in. Which tracts are “gaining” and “losing” vehicles from user activity alone? While, of course, many rides begin and end within the same census tract, we see below user activity leads to unbalanced flows. Without active rebalancing from providers, vehicles would become concentrated in just a few tracts, and user demand in other tracts would go unsatisfied.

The plots on the left show the net inflow/outflow of vehicles for each census tract during the study period. The maps on the right show this rate relative to its total inflow; a tract that gained a net of 10 vehicles while seeing a total inflow of 20 vehicles would have an inflow rate of 0.5.

Net Scooter Flows by City

Louisville

Washington, DC

Austin

Minneapolis

Net flow data is not available for Minneapolis.

Kansas City

Chicago

Some of the data sets include information on ride durations and distances. We don’t include investigate those data here, as they were not pertinent to our prediction model, but we do explore them later on when we discuss compliance with distribution requirements.

3.2 Feature Variables

The six cities we’ve chosen for the analysis vary greatly in size and demographic and socioeconomic characteristics. This makes producing a model that predicts raw trip counts a difficult challenge, but it also protects against the possibility of our model overfitting to a certain type of city.

Distributions by City

Demographic

Socio-economic

During the feature engineering process, we experimented with variations of the built environment variables. We tried the variations below:

  • Density: The number of restaurants per square mile in the tract
  • Count: The total number of restaurants in the tract
  • KNN: The distance from the tract centroid to the newest k restaurants (where we experimented with a range of k values)
  • Ratio: The percentage of the city’s restaurants located within that tract

Ultimately, we selected the Ratio versions, because those displayed the greatest correlation with user pickups in each tract. Below, we see the correlation plots for every feature variable in our analysis with the number of pickups in each tract.

Correlation Plots

Demographic

Socio-economic

Built Environment

Final Features

Our final model included the following 24 features:

Variable Description
Demographic
TOTPOP Total population
MDAGE Median age
PWHITE % of the population that is white
PFEMALE % of the population that is female
Socio-economic
MDHHINC Median household income (2018 dollars)
MEDVALUE Median home value
MEDRENT Median rent
PTRANS % of the population that takes transit to work
PDRIVE % of the population that drives to work
PCOM30PLUS % of the population with a commute >30 minutes
TOTHSEUNI Total housing units
POCCUPIED Housing occupancy rate
PVEHAVAI % of the population that owns a vehicle
JOBS_IN_TRACT The number of jobs located in this tract
WORKERS_IN_TRACT The number of workers who live in this tract
Built Environment
RATIO_RETAIL % of the city’s retail found in this tract
RATIO_OFFICE % of the city’s offices found in this tract
RATIO_RESTAURANT % of the city’s restaurants found in this tract
RATIO_PUBLIC_TRANSPORT % of the city’s public transportation found in this tract
RATIO_LEISURE % of the city’s places for leisure activity found in this tract
RATIO_TOURISM % of the city’s tourist attractions found in this tract
RATIO_COLLEGE % of the city’s university buildings found in this tract
RATIO_CYCLEWAY % of the city’s cycle infrastructure found in this tract
RATIO_STREET % of the city’s streets found in this tract

Below, we see a display a correlation matrix with the final features:

%%[corr plot takeaways]%%

4. Case Study: Louisville Rebalancing Compliance

Like many cities with scooter share, the City of Louisville has imposed vehicle caps and distribution requirements on their providers to ensure that scooter companies do not flood high traffic areas of the city with unused vehicles and to promote equitable access to the vehicles across neighborhoods. Louisville’s scooter policy is summarized below and can be found in full here.

Policy:

  • Distribution Requirements

    • “To ensure access to shared mobility transportation options throughout the community, Metro has established distribution zones. Distribution zones are intended to ensure that no singular zone is intentionally over-served or under-served. Operators must comply with distributional requirements. Failure to comply with this provision constitutes a breach of the license and may result in the assessment of fleet size reductions, suspension, or even termination of the license. The duration of any suspension shall be at the sole discretion of Metro but will be no less than 6 months. Terminations shall apply for 1 year.”

    • For operators with 150 permitted vehicles or fewer, there are no distributional requirements.

    • For operators with permitted fleets ranging in size between 150 and 350 vehicles, 20% of each operator’s vehicles must be located within zones 1 and 9.

    • Distribution plans within Zones 1 and 9 must be submitted to Metro for approval to ensure adequate accessibility for residents of each zone has been achieved.

    • For fleets ranging in size between 350 and 1050 vehicles, 20% of each operator’s vehicles must be located within zones 1 and 9 and 10% must be in zone 8.

    • Distribution plans within Zones 1, 8, and 9 must be submitted to Metro for approval to ensure adequate accessibility for residents of each zone has been achieved.

  • Current Vehicle Limits:

    • Bird - 450 max vehicles/day - launched August 2018
    • Lime - 450 max vehicles/day - launched November 2018
    • Bolt - 150 max vehicles/day - launched July 2019
    • Spin - 150 max vehicles/day - launched August 2019

For privacy reasons, most cities (including Louisville) only post geographically aggregated user ride data on their public open data sites. These datasets, while helpful for identifying broad trends in ridership, can lack the geographic resolution to tell us where exactly riders are going. Additionally, the datasets typically do not include any information on vehicle movements other than user rides, meaning we cannot discern where providers are adding or removing vehicles to and from the fleet through maintenance or rebalancing activity. For our analysis, the City of Louisville shared their providers’ status changes dataset (the “Rebalancing Data”), a non-public dataset that, in addition to user ride data, includes other vehicle events such as rebalancings and maintenance. These data points are also fully disaggregated. Whereas Louisville’s public scooter dataset rounds location coordinates to the third decimal point, the Rebalancing Data includes the raw coordinates.

Using the Rebalancing Data, we investigate whether Louisville’s largest two scooter suppliers, Bird and Lime, which are its only suppliers subject to the distribution requirements, have been complying with the city’s policy. At distinct points in time, are Bird and Lime’s scooter vehicles distributed across Louisville in compliance with the distribution requirements?

A map of zones 1, 8, and 9, which must receive a percentage of Bird and Lime’s daily fleet as part of the distribution requirements, is shown below.

In its raw form, however, the Rebalancing Data is not well-suited for answering this question. The dataset is currently organized around events, where each row is a status change event (the reason column) for a particular vehicle (vehicleId) that took place at a certain location and time (occurredAt), making it difficult to develop an aggregate picture for how each operator’s vehicles are distributed across the city at any point in time. The dataset tells us about vehicle flows, but we need information on the vehicle fleet.

## Observations: 856,700
## Variables: 10
## $ id                 <chr> "ed349118-ccd9-4fbe-8b3d-311b10dc223e", "4c4d40d...
## $ url                <chr> "https://www.li.me/", "https://www.li.me/", "htt...
## $ type               <chr> "reserved", "unavailable", "removed", "available...
## $ reason             <chr> "user pick up", "maintenance", "rebalance pick u...
## $ location           <chr> "POINT(-85.740319 38.256947)", "POINT(-85.76038 ...
## $ operators          <chr> "Lime Louisville", "Lime Louisville", "Bolt Loui...
## $ vehicleId          <chr> "63f13c48-34ff-49d2-aca7-cf6a5b6171c3-516765", "...
## $ occurredAt         <dttm> 2019-05-27 05:52:36, 2019-11-15 12:00:35, 2019-...
## $ vehicleType        <chr> "scooter", "scooter", "scooter", "scooter", "sco...
## $ vehicleEnergyLevel <dbl> 0.3500, 0.7800, 0.8133, 0.8200, 1.0000, 0.7900, ...

We solve this problem by selecting the most recent event for each vehicle in the dataset (prior to the selected audit time), finding each vehicle’s location, and assessing whether it is available for users. Then, we can aggregate the available scooters by distribution zone to determine whether the scooter providers are in compliance.

The Rebalancing Data contains 11 different status change events. We aggregate these events into three categories:

  • Active: Scooters whose most recent event was an Active event are available for users to ride.
  • Reserved: These scooters are currently being used by a rider.
  • Inactive: These scooters are not available to users. We consider them removed from the vehicle fleet.

We next set time periods for our rebalancing audits. We decided to audit the vehicle fleet at 7AM every Friday for 13 months, from November 15th, 2018 to December 15th, 2019. We chose 7AM because our exploratory analysis revealed to us that most rebalancing activity occurs in the nighttime and early morning hours.

We then define a function to extract the most recent Active status in the dataset for each vehicle before our 57 selected audit times. We also remove any scooter whose most recent Active status occurred over 10 days prior to the audit time from the dataset. We assume that these scooters have been removed from the active vehicle fleet without a corresponding status change record.

plan(multiprocess) ## FOR PARALLEL PROCESSING

LV_extract_latest_status2 <- function(trip_dat, datetime, buffer, 
                                      Astatus = LV_active_status){
  time <- as.POSIXct(datetime)
  tmp <- trip_dat[which(trip_dat$occurredAt <= time),]
  # first pass to modify is data remains
  if(nrow(tmp) > 0) {
    tmp <- tmp[order(tmp$occurredAt),]
    tmp <- tmp[nrow(tmp),]
    tmp <- tmp[as.numeric(time - tmp$occurredAt) <= buffer,]
    tmp <- tmp[tmp$reason %in% Astatus,] 
  }
  # 2nd pass if the above still had rows (e.g. stilla active)
  if(nrow(tmp) > 0) {
    output <- tmp
    output$Date <- as.Date(output$occurredAt)
    output$Hour <- lubridate::hour(output$occurredAt)
    output$active <- 1
    output <- output[,c("vehicleId", "Date", "Hour", 
                        "operators", "active", "long", "lat")]
  } else { # if the scooter is "unavailable"
    output <- data.frame(vehicleId = trip_dat$vehicleId[1],
                         Date = as.Date(time),
                         Hour = hour(time),
                         operators = trip_dat$operators[1],
                         active = 0,
                         long = NA_real_,
                         lat = NA_real_,
                         stringsAsFactors = FALSE)
  }
  return(output)
}

new_func_parallel <- function(...){
  rebal_lst <- LV_rebal_sf %>% 
    mutate(long = st_coordinates(.)[,1], 
           lat = st_coordinates(.)[,2]) %>%
    st_drop_geometry() %>%
    split(.$vehicleId)
  
  LV_rebal_sf_list_i <- future_map(time_intervals,
                                   function(x) map(rebal_lst, function(y){LV_extract_latest_status2(y, x, 10)}) %>%
                                     bind_rows() %>% 
                                     mutate(audit_date = x), .progress = TRUE) %>% 
    bind_rows()
}

new_results_parallel <- new_func_parallel() # same as LV_rebal_sf_list

glimpse(new_results_parallel)
## Observations: 173,736
## Variables: 8
## $ vehicleId  <chr> "2411d395-04f2-47c9-ab66-d09e9e3c3251-1-c6-w5", "2411d39...
## $ Date       <date> 2018-11-15, 2018-11-15, 2018-11-15, 2018-11-15, 2018-11...
## $ Hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,...
## $ operators  <fct> Bird Louisville, Bird Louisville, Bird Louisville, Bird ...
## $ active     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ long       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ lat        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ audit_date <dttm> 2018-11-15 07:00:00, 2018-11-15 07:00:00, 2018-11-15 07...

Next, we aggregate the available scooters across the distribution zones. We determine whether an operator is in compliance with the distribution requirements based on the percentage of the vehicle fleet in zones 1, 8, and 9 and the total size of that vehicle fleet at that time. The two sets of distribution requirements only apply to operators permitted to operate over 150 and over 350 scooters in the city, respectively. While Bird and Lime are each permitted to deploy 450 vehicles now, we were unable to determine when their vehicle limits were raised to 150 and 350. As a proxy for vehicle limit, we instead use the total size of their fleets as reflected in the dataset (scooter_total). We acknowledge, however, that this may underestimate the two companies’ permitted fleet size at the time, as the maximum active fleet size we calculated during our 57 audits was 339, far short of their 450-vehicle maxes.

LV_rebal_sf_list_2 <- new_results_parallel %>% 
  filter(!is.na(long),
         !is.na(lat)) %>% 
  st_as_sf(coords = c("long", "lat"), crs = LV_proj, remove = FALSE) %>% 
  st_join(., LV_distro_areas %>% dplyr::select(Dist_Zone)) %>% 
  st_drop_geometry() %>% 
  mutate(Dist_Zone = factor(Dist_Zone,
                            levels = paste(1:9)))

LV_rebal_sf_list_summary <- new_results_parallel %>% 
  left_join(LV_rebal_sf_list_2 %>% dplyr::select(vehicleId, Dist_Zone, audit_date), by = c("vehicleId", "audit_date")) %>% 
  group_by(audit_date, Dist_Zone, operators, .drop = FALSE) %>% 
  summarize(scooters = n()) %>% 
  filter(str_detect(operators, "Bird|Lime"),
         !is.na(Dist_Zone)) %>%
  ungroup() %>%
  group_by(audit_date, operators) %>%
  mutate(scooter_total = sum(scooters),
         scooter_pct = scooters / scooter_total)

LV_rebal_sf_list_summary_2 <- LV_rebal_sf_list_summary %>% 
  dplyr::select(-scooter_pct) %>% 
  spread(Dist_Zone, scooters, sep = "_") %>% 
  mutate(Dist_8_pct = ifelse(is.na(Dist_Zone_8 / scooter_total), 0, Dist_Zone_8 / scooter_total), 
         Dist_1_9_pct = ifelse(is.na((Dist_Zone_1 + Dist_Zone_9) / scooter_total), 0, (Dist_Zone_1 + Dist_Zone_9) / scooter_total),
         compliance = case_when(scooter_total > 150 & Dist_1_9_pct < 0.2 ~ "No",
                                scooter_total > 350 & (Dist_1_9_pct < 0.2 | Dist_8_pct < 0.1) ~ "No",
                                TRUE ~ "Yes"))

LV_rebal_sf_list_summary_map <- LV_rebal_sf_list_summary %>% 
  ungroup() %>% 
  group_by(Dist_Zone, operators) %>% 
  summarize(scooter_pct = mean(scooter_pct, na.rm = TRUE)) %>% 
  left_join(LV_distro_areas, by = "Dist_Zone") %>% 
  st_as_sf() %>% 
  arrange(operators)

LV_rebal_sf_list_summary_2_map <- LV_rebal_sf_list_summary_2 %>% 
  gather(dist_zone, dist_pct, Dist_8_pct:Dist_1_9_pct) %>% 
  mutate(requirement = case_when(dist_zone == "Dist_8_pct" ~ 0.1,
                                 dist_zone == "Dist_1_9_pct" ~ 0.2,
                                 TRUE ~ NA_real_),
         dist_zone = factor(case_when(dist_zone == "Dist_8_pct" ~ "Dist_8_pct",
                                      dist_zone == "Dist_1_9_pct" ~ "Dist_1_9_pct",
                                      TRUE ~ NA_character_),
                            levels = c("Dist_8_pct", "Dist_1_9_pct"),
                            labels = c("Zone 8", "Zone 1 & 9")))

LV_rebal_sf_list_summary_2_map
## # A tibble: 228 x 16
## # Groups:   audit_date, operators [228]
##    audit_date          operators scooter_total Dist_Zone_1 Dist_Zone_2
##    <dttm>              <fct>             <int>       <int>       <int>
##  1 2018-11-15 07:00:00 Bird Lou~             0           0           0
##  2 2018-11-15 07:00:00 Lime Lou~            95           1          73
##  3 2018-11-22 07:00:00 Bird Lou~             0           0           0
##  4 2018-11-22 07:00:00 Lime Lou~            65           1          15
##  5 2018-11-29 07:00:00 Bird Lou~             0           0           0
##  6 2018-11-29 07:00:00 Lime Lou~           105           1          52
##  7 2018-12-06 07:00:00 Bird Lou~             0           0           0
##  8 2018-12-06 07:00:00 Lime Lou~           107           1          40
##  9 2018-12-13 07:00:00 Bird Lou~             0           0           0
## 10 2018-12-13 07:00:00 Lime Lou~            74           0          38
## # ... with 218 more rows, and 11 more variables: Dist_Zone_3 <int>,
## #   Dist_Zone_4 <int>, Dist_Zone_5 <int>, Dist_Zone_6 <int>, Dist_Zone_7 <int>,
## #   Dist_Zone_8 <int>, Dist_Zone_9 <int>, compliance <chr>, dist_zone <fct>,
## #   dist_pct <dbl>, requirement <dbl>

Below, we chart the percentage of each operator’s vehicle fleet that could be found in the three distribution zones at the time of each audit. The red line on each chart indicates the minimum percentage of the vehicle fleet that must be located in each zone to comply with the distribution requirements.

While we emphasize again that we are not sure when the distribution requirements took effect for Bird and Lime, we can see that even in the later months of 2019, when we can assume their vehicle fleet limits were near their current 450, there are only two instances where either company was in compliance with at least one of the requirements.

5. Model Building

5.1 Modeling Strategy

We used the final set of features to construct several models that predict raw trip counts in each census tract for the cities in our study. We employed the following modeling frameworks:

  • Linear Model: an OLS linear regression
  • Penalized Linear Model: a linear regression model that uses regularization to prevent overfitting to insignificant predictor variables. During model tuning, we tested L1, L2, and elastic net regularization.
  • Random Forest: an ensemble method that aggregates predictions from many decision trees for classification or regression tasks. It employs bagging (bootstrap aggregation) to protect against overfitting the decision trees to the training data.
  • XGBoost: another tree-based ensemble method. Unlike random forest, which creates many trees at once and aggregates their results at the end, this method builds trees iteratively, employing boosting in the bagging process to address large prediction errors in previous trees.

The graphic below helps to visualize our model selection process. First, we split the data into a 75%/25% training set and test set, stratified across the six cities. Within the 75% training set, we then used a grid search to select the optimal hyperparameters for each modeling framework. To improve the robustness of the grid search, we used both random k-fold (across 20 folds) and leave-one-group-out (leaving one city out at a time) cross-validation within the training set. In this cross-validation process, each fold or city within the training set is held out while the grid search is run over the remaining folds or cities to select hyperparameters. Predictions are then produced on the hold-out fold or city, and the root mean squared error (RMSE) is calculated for those predictions (out-of-fold error). This process is repeated on each fold and city, and for each modeling framework, we select the hyperparameters for each modeling framework that minimizes the average of RMSEs across folds. Finally, we test these models on the 25% test set and select the framework with the lowest mean absolute error (MAE).

5.2 Model Evaluation

The plots below show the average MAE and RMSE produced on out-of-group predictions produced using the optimized hyperparameters for each modeling framework. Interestingly, the random forest model had the best performance in terms of MAE but performed poorly in terms of RMSE. Our interpretation of this is that the random forest model was predicting well for many census tracts compared to the other frameworks, but there were a few predictions with very large errors that drove up the RMSE, which is more sensitive to outliers than MAE. In general, the errors were quite high. This reflects the difficulty of using a fairly small sample of cities that vary greatly to make predictions of raw ridership counts.

Average RMSE and MAE for Out-of-fold Predictions

Below, we present scatter plots showing the out-of-fold and out-of-city predictions and errors for each method. We see that errors were larger in general for the out-of-city predictions, again demonstrating the challenge of generalizing ridership predictions across cities with such different characteristics.

We next plot the MAEs for each model’s predictions on the 25% test set. While we used RMSE for optimizing the models in an attempt to limit extreme outliers, we choose to present MAEs below because they are easier to interpret.

MAE for Test Set Predictions

Below, we show an example of the out-of-fold predictions and prediction errors made on Austin when it was treated as a hold-out group in cross-validation. Note that the “missing” census tracts in the map below represented those tracts that had been randomly selected for the 25% testing set. We see that ridership was drastically underestimated in a couple downtown census tracts.

%%[Map of Austin predictions]%%

We average the errors across cities generated by our final random forest model and plot them below. The errors vary significantly across the cities. The median error ranges from a low of 80 in Chicago to a high of over 700 in Louisville.

The model also varies greatly in performance across racial contexts. Below, we see variations in error between majority white and majority non-white census tracts in the cities in our study, suggesting that our model tends to underpredict trips in majority white neighborhoods while over-predicting trips in majority non-white areas.

%%[Table with comparison across race contexts]%%

5.3 Predictions

Finally, we use our model to produce predictions on 10 cities without scooter share systems. As we would expect, most of the predicted rides are concentrated within the cities’ central business districts and areas near universities. In Philadelphia, for example, we see high predicted ridership in Center City, as well as the neighborhoods around the University of Pennsylvania, Drexel University, and Temple University. However, we also see high predictions in outlying census tracts for some cities. In Madison, for instance, we see very high predictions for ridership at the western periphery of the city.

These plots can be explored interactively in the web application we’ve built for this project, which is further described in Section 7.

Predicted Ridership by City

Asheville, North Carolina

Hartford, Connecticut

Houston, Texas

Jersey City, New Jersey

Jacksonville, Florida

Madison, Wisconsin

Omaha, Nebraska

Philadelphia, Pennsylvania

San Antonio, Texas

Syracuse, New York

6. Equity of Access

8. Code Appendix